Goto

Collaborating Authors

 Agent Societies


Core-Halo Decomposition: Decentralizing Large-Scale Fixed-Point Problems

arXiv.org Machine Learning

We study solving large-scale fixed-point equation x = F(x) with decomposition. Standard strict decomposition assigns each agent a disjoint block and evaluates updates using only owned coordinates. For most operators, however, a block update may depend on variables outside the block. Truncating these dependencies by strict decomposition changes the mean operator and creates structural bias that cannot be removed by more samples, smaller stepsizes, or additional consensus. We therefore propose Core-Halo decomposition, which separates write ownership from read-only evaluation context: each agent updates its own core and reads from an overlapping halo. By aligning the Core-Halo decomposition with the blockdependence structure of F, the original fixed-point problem can be implemented faithfully in a decentralized multi-agent system. We further characterize the fundamental obstruction faced by strict decomposition through a Bellman closure condition and a blockwise bias lower bound, showing that local-only updates can alter the original fixed-point operator. Finally, we conduct extensive experiments across a range of application settings, and demonstrate that Core-Halo achieves near-centralized performance while retaining the parallelism benefits of decentralization.


Decentralized Diffusion Policy Learning for Enhanced Exploration in Cooperative Multi-agent Reinforcement Learning

arXiv.org Machine Learning

Cooperative multi-agent reinforcement learning (MARL) involves complex agent interactions and requires effective exploration strategies. A prominent class of MARL algorithms, decentralized softmax policy gradient (DecSPG), addresses this through energy-based policy updates. In practice, however, such energy-based policies are intractable to maintain and are commonly projected onto the Gaussian policy class. In this work, we show that the limited expressiveness of Gaussian policies severely hinders exploration in DecSPG, and this limitation worsens as the number of agents grows. To address this issue, we propose decentralized diffusion policy learning (DDPL), which parameterizes each agent's policy with a denoising diffusion probabilistic model, an expressive generative model that captures multi-modal action distributions for enhanced exploration. DDPL enables efficient online training of diffusion policies via importance sampling score matching (ISSM), a novel training method with theoretical guarantee. We evaluate DDPL on representative continuous-action MARL benchmarks, including multi-agent particle environment, multi-agent MuJoCo, IsaacLab, and JAX-reimplemented StarCraft multi-agent challenge, and observe consistently improved performance.


Bandit Learning in General Open Multi-agent Systems

arXiv.org Machine Learning

Recent developments in digital platforms have highlighted the prevalence of open systems, where agents can arrive and depart over time. While bandit learning in open systems has recently received initial attention, existing work imposes structural assumptions that are frequently violated in practice. A learning paradigm for general open systems creates fresh challenges: newly arriving agents induce endogenous non-stationarity; agent patterns determine how quickly information accumulates; and new agents make regret scale further with the time horizon. To this end, we formulate a unified open-system bandit problem with general dynamics, including heterogeneous rewards and general agent patterns. We introduce new concepts to capture the inherent complexities: the \emph{pre-training degree} of new agents quantifies how much information an agent carries upon entry, \emph{stability} measures the impact of new agents on the system, and \emph{global dynamic regret} compares the cumulative expected reward of all active agents with that of the varying optimal arms. We develop certified global-UCB learning methodologies with provable guarantees. Our regret bounds reveal that entry uncertainty enters linearly via the pre-training degree, while in stable regimes, regret is governed by the time needed to identify a persistent optimal arm, as well as by the agent patterns. We further show that these dependencies are tight via lower bounds in hard instances.


Mean-Field Path-Integral Diffusion: From Samples to Interacting Agents

arXiv.org Machine Learning

Independent sample generation is the prevailing paradigm in modern diffusion-based generative models of AI. We ask a different question: can samples coordinate through shared population statistics to transport probability mass more efficiently? We introduce Mean-Field Path-Integral Diffusion (MF-PID), a framework in which samples are promoted to interacting agents whose drift depends self-consistently on the evolving population density. We identify two analytically tractable regimes: a Linear-Quadratic-Gaussian (LQG) benchmark in which the infinite-dimensional mean-field system reduces to a finite set of Riccati and linear ODEs, and a Gaussian-mixture regime governed by a piecewise-constant protocol that preserves closed-form solvability. For a quadratic interaction potential with schedule ฮฒt and zero base drift we prove that the self-consistent MF guidance is the exact linear interpolant between initial and target global means -- a result that holds for arbitrary initial and target densities and any ฮฒt. Applied to demand-response control of energy systems, where agents aggregated into an ensemble are energy consumers (e.g. The energy saving is independent of the number of zones per building (d = 1-32 tested), confirming that the linear guidance formula broadcasts a single d-vector with O(d) communication and grows mildly in compute (sub-cubically for d 32, asymptotically O(d3) for d 1). Introduction Generative AI has been transformed by diffusion models, which frame sample generation as a stochastic process steered from noise to data [1-3]. A key structural feature of these models -- shared with other generative models, e.g. Similarly, stochastic optimal transport (SOT) and Schrรถdinger bridge formulations [6-8] cast distribution matching as an independent-particle path optimization, yielding tractable convolutions of Green functions but discarding inter-particle information; stochastic interpolants [9] construct flexible transport bridges between arbitrary densities via tunable continuous-time stochastic processes, recovering the Schrรถdinger bridge as a special limit -- again in an independent-particle framework.


AlberDICE: Addressing Out-Of-Distribution Joint Actions in Offline Multi-Agent RL via Alternating Stationary Distribution Correction Estimation

Neural Information Processing Systems

One of the main challenges in offline Reinforcement Learning (RL) is the distribution shift that arises from the learned policy deviating from the data collection policy. This is often addressed by avoiding out-of-distribution (OOD) actions during policy improvement as their presence can lead to substantial performance degradation. This challenge is amplified in the offline Multi-Agent RL (MARL) setting since the joint action space grows exponentially with the number of agents. To avoid this curse of dimensionality, existing MARL methods adopt either value decomposition methods or fully decentralized training of individual agents. However, even when combined with standard conservatism principles, these methods can still result in the selection of OOD joint actions in offline MARL. To this end, we introduce AlberDICE, an offline MARL algorithm that alternatively performs centralized training of individual agents based on stationary distribution optimization. AlberDICE circumvents the exponential complexity of MARL by computing the best response of one agent at a time while effectively avoiding OOD joint action selection. Theoretically, we show that the alternating optimization procedure converges to Nash policies. In the experiments, we demonstrate that AlberDICE significantly outperforms baseline algorithms on a standard suite of MARL benchmarks.



Dual Self-Awareness Value Decomposition Framework without Individual Global Max for Cooperative MARL

Neural Information Processing Systems

Value decomposition methods have gained popularity in the field of cooperative multi-agent reinforcement learning. However, almost all existing methods follow the principle of Individual Global Max (IGM) or its variants, which limits their problem-solving capabilities. To address this, we propose a dual self-awareness value decomposition framework, inspired by the notion of dual self-awareness in psychology, that entirely rejects the IGM premise. Each agent consists of an ego policy for action selection and an alter ego value function to solve the credit assignment problem. The value function factorization can ignore the IGM assumption by utilizing an explicit search procedure. On the basis of the above, we also suggest a novel anti-ego exploration mechanism to avoid the algorithm becoming stuck in a local optimum. As the first fully IGM-free value decomposition method, our proposed framework achieves desirable performance in various cooperative tasks.


Statistical and Computational Trade-off in Multi-Agent Multi-Armed Bandits

Neural Information Processing Systems

We study the problem of regret minimization in Multi-Agent Multi-Armed Bandits (MAMABs) where the rewards are defined through a factor graph. We derive an instance-specific regret lower bound and characterize the minimal expected number of times each global action should be explored. This bound and the corresponding optimal exploration process are obtained by solving a combinatorial optimization problem whose set of variables and constraints exponentially grow with the number of agents, and cannot be exploited in the design of efficient algorithms. Inspired by Mean Field approximation techniques used in graphical models, we provide simple upper bounds of the regret lower bound. The corresponding optimization problems have a reduced number of variables and constraints. By tuning the latter, we may explore the trade-off between the achievable regret and the complexity of computing the corresponding exploration process. We devise Efficient Sampling for MAMAB (ESM), an algorithm whose regret asymptotically matches the approximated lower bounds. The regret and computational complexity of ESM are assessed numerically, using both synthetic and real-world experiments in radio communications networks.


Scenario Diffusion: Controllable Driving Scenario Generation With Diffusion

Neural Information Processing Systems

Automated creation of synthetic traffic scenarios is a key part of validating the safety of autonomous vehicles (AVs). In this paper, we propose Scenario Diffusion, a novel diffusion-based architecture for generating traffic scenarios that enables controllable scenario generation. We combine latent diffusion, object detection and trajectory regression to generate distributions of synthetic agent poses, orientations and trajectories simultaneously. To provide additional control over the generated scenario, this distribution is conditioned on a map and sets of tokens describing the desired scenario. We show that our approach has sufficient expressive capacity to model diverse traffic patterns and generalizes to different geographical regions.


Strategic Distribution Shift of Interacting Agents via Coupled Gradient Flows

Neural Information Processing Systems

We propose a novel framework for analyzing the dynamics of distribution shift in real-world systems that captures the feedback loop between learning algorithms and the distributions on which they are deployed.